Conversation
brian-dellabetta
left a comment
There was a problem hiding this comment.
Definitely looks cleaner this way! Leaving comments rather than approving, as I am still getting up to speed with pipelines
brian-dellabetta
left a comment
There was a problem hiding this comment.
I know you're looking for feedback on this, but I'm not sure I understand it enough to approve. I do like the removal of all the try/catch code in GPTQ. Maybe we can have a deep dive session on this next week?
## Purpose ## * Revert the behavior regression introduced as a result of #1114 * When calibrating a model using the `QuantizationModifier`, quantization should be enabled when calibrating ## Changes ## * Remove "disabling quantization" from the calibration forward pass * Add "disabling quantization" to the sequential pipelines in order to continue to disable quantization during calibration for GPTQ and SGPT * When [calibration pipelines become shared between modifiers](#1279), the decision of whether to disabling quantization during calibration will have to be moved to the calibration pipelines themselves. Some work needs to be done to demonstrate that GPTQ and SGPT do not suffer accuracy regression from enabling activation quantization during calibration (in theory, the change should increase accuracy) --------- Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>
ee33c44 to
46f6811
Compare
46f6811 to
2182705
Compare
3213a7d to
fa75986
Compare
2182705 to
92c9dee
Compare
b9c91e7 to
3fdbb8d
Compare
|
Looks like there's one perplexity failure, although I wasn't able to replicate locally https://github.com/neuralmagic/llm-compressor-testing/actions/runs/14622080074/job/41024772318#step:13:31981 |
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>
brian-dellabetta
left a comment
There was a problem hiding this comment.
really cool! excited to try this out. Should we run the e2e/lmeval tests before merging this in? Lots of moving pieces, they might catch something
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>
|
I've validated the previously failing awq e2e test passes locally |
rahul-tuli
left a comment
There was a problem hiding this comment.
Looks good overall, left some minor comments.
One change/resolution/explanation requested for independent pipelines.
Generally I see a lot of similar TODO's trickled across multiple file(s); would like to address/delete/link out to ticket or issues before merge.
Great job on this!
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>
Looks like tests passed, but there was a issue reporting timings, possibly expected for this manual run and unrelated to these changes. So I think we're good to go on this! |
## Purpose ## * Extract data pipelines from modifiers to enable multiple modifiers to be active at the same time * This enables faster compression of larger models * This enables more memory efficient compression of larger models (not limited to just GPTQ/SGPT) ## Prerequisites ## * #1351 * #1298 ## Callback Changes ## * Implement `calibration_epoch_start` * This callback should be called at the start of every calibration pipeline * This callback causes modifiers to attach hooks * Implement `sequential_epoch_end` * This callback should be called after one sequential layer has been calibrated with one epoch * This callback triggers compression and replaces passing a `callback_modifier` * Implement `calibration_epoch_end` * This callback triggers at the end of a calibration epoch, and is used to *trigger compression* in between pipelines composed using the independent pipeline and *remove hooks* in between independent pipelines ## Lifecycle Changes ## * Oneshot modifiers implement on_end, which removes hooks when calibration finishes * In the future, calibration_epoch_start is treated like batch_start, where it is an opportunity for modifiers to start * In the future, calibration_epoch_end is treated like batch_end, where it is an opportunity for modifiers to end * In the future, finalize is treated like batch_end, where it is an opportunity for modifiers to end * Right now, these opportunities are implemented manually on each oneshot modifier, rather than being a lifecycle rule ## Data Pipeline Changes ## * Implement data pipeline registry * Inferred pipeline is selected using modifiers and can be overridden by user * Implement independent pipeline * This pipeline treats each modifier as a separate stage and assigns a pipeline to each modifier * Meant to replicate current LC behavior * Originally, these compression events were triggered by reaching the end of each module’s initialize function. Now a separate event is required * Implement `session.get_modifiers` * In order to perform data pipeline inference and other sequential pipeline inference, these functions must get the list of active modifiers before they initialize * This function gets all the active modifiers across all `ModifierStages` * Prepare smoothquant for pipeline extraction * Trigger `_apply_smoothing` on the `sequential_epoch_end ` and `calibration_epoch_end` * Add a [guard](https://github.com/vllm-project/llm-compressor/pull/1244/files#diff-90bb5efcbf5f23ba1db62664a91f6b2d6492a909c387cd82c1589f45d5e8615cR285) which allows the `_apply_smoothing` function to be called multiple times per session (as is required by sequential pipeline) ## Testing ## * Quantized llama3-8b using both the independent (basic + sequential) and sequential pipelines * There was no accuracy regression from using a shared pipeline, although we keep the `independent` pipeline as the default for now * Transformers tests pass * https://github.com/neuralmagic/llm-compressor-testing/actions/runs/14622080074 --------- Signed-off-by: Kyle Sayers <kylesayrs@gmail.com> Signed-off-by: shanjiaz <zsjwpianpian@gmail.com>
Purpose
Prerequisites
QuantizationMixin#1351align_module_deviceutil #1298Callback Changes
calibration_epoch_startsequential_epoch_endcallback_modifiercalibration_epoch_endLifecycle Changes
Data Pipeline Changes
session.get_modifiersModifierStages_apply_smoothingon thesequential_epoch_endandcalibration_epoch_end_apply_smoothingfunction to be called multiple times per session (as is required by sequential pipeline)Testing
independentpipeline as the default for now